
% ============================================================================

\subsection{Shared Bitmanip Extension Functionality}
\label{sec:scalar:bitmanip}

Many of the primitive operations used in symmetric key cryptography
and cryptographic hash functions are well supported by the
RISC-V Bitmanip \cite{riscv:bitmanip:repo} extension
\footnote{
At the time of writing, the Bitmanip extension is still undergoing
standardisation.
Please refer to the Bitmanip draft specification
\cite{riscv:bitmanip:draft}
directly for the
latest information, as it may be slightly ahead of what is described
here.
}.
We propose that the scalar cryptographic extension {\em reuse} a
subset of the instructions from the Bitmanip extension directly.
Specifically, this would mean that
a core implementing
{\em either}
the scalar cryptographic extensions,
{\em or}
the Bitmanip extension,
{\em or}
both,
would be required to implement these instructions.

%
% TODO: Venn diagram of proposed instructions.
%

The following subsections give the assembly syntax of instructions
proposed for inclusion in the scalar crypto extension, along with a
set of use-cases for common algorithms or primitive operations.
For information on the semantics of the instructions, we refer directly
to the Bitmanip draft specification.

\subsubsection{Rotations}
\label{sec:scalar:bitmanip:rotate}

\begin{cryptobitmanipisa}
RV32, RV64:                         RV64 only:
    ror    rd, rs1, rs2                 rorw   rd, rs1, rs2
    rol    rd, rs1, rs2                 rolw   rd, rs1, rs2
    rori   rd, rs1, imm                 roriw  rd, rs1, imm
\end{cryptobitmanipisa}

See \cite[Section 3.1.1]{riscv:bitmanip:draft} for details of
these instructions.
Standard bitwise rotation is a primitive operation in many block ciphers and
hash functions;
it features particularly in the ARX (Add, Rotate, Xor) class of
block ciphers
\footnote{\url{https://www.cosic.esat.kuleuven.be/ecrypt/courses/albena11/slides/nicky_mouha_arx-slides.pdf}}
and stream ciphers.

Algorithms making use of 32-bit rotations:
SHA256, AES (Shift Rows), ChaCha20, SM3.

Algorithms making use of 64-bit rotations:
SHA512, SHA3.

\subsubsection{Bit \& Byte Permutations}
\label{sec:scalar:bitmanip:grev}

\begin{cryptobitmanipisa}
RV32:
    rev.b   rd, rs1 // grevi rd, rs1,  7 - Reverse bits in bytes
    rev8    rd, rs1 // grevi rd, rs1, 24 - Reverse bytes in 32-bit word

RV64:
    rev.b   rd, rs1 // grevi rd, rs1,  7 - Reverse bits in bytes
    rev8    rd, rs1 // grevi rd, rs1, 56 - Reverse bytes in 64-bit word
    rev8.w  rd, rs1 // grevi rd, rs1, 24 - Reverse bytes in 32-bit words
\end{cryptobitmanipisa}

The scalar cryptography extension provides the following instructions for
manipulating the bit and byte endianness of data.
They are all parameterisations of the Generalised Reverse with Immediate
({\tt grevi}) instruction.
The scalar cryptography extension requires {\em only} the above instances
of {\tt grevi} be implemented, which can be invoked via their pseudo-ops.

Reversing bytes in words is very common in cryptography when setting a
standard endianness for input and output data.
Bit reversal within bytes is used for implementing the GHASH  component
of Galois/Counter Mode (GCM)~\cite{nist:gcm}.

Cores which also implement the Bit-manipulation extension {\em must}
implement the complete {\tt grevi} instruction, with all immediate values
supported.
The full specification of the {\tt grevi} instruction is available in
\cite[Section 2.2.2]{riscv:bitmanip:draft}.

\begin{cryptobitmanipisa}
RV32:
    zip     rd, rs1 // shfli   rd, rs1, 15 - Bit interleave
    unzip   rd, rs1 // unshfli rd, rs1, 15 - Bit de-interleave
\end{cryptobitmanipisa}

The {\tt zip} and {\tt unzip} pseudo-ops are specific instances of
the more general {\tt shfli} and {\tt unshfli} instructions.
The scalar cryptography extension requires {\em only} the above instances
of {\tt [un]shfli} be implemented, which can be invoked via their
pseudo-ops.
Only RV32 implementations require these instructions.
They perform a bit-interleave (or de-interleave) operation, and are
useful for implementing the 64-bit rotations in the
SHA3~\cite{nist:fips:202} algorithm on
a 32-bit architecture\footnote{
It is also useful for the ASCON cipher, which is a candidate in the
NIST Lightweight Cryptography competition.
}.
On RV64, the relevant operations in SHA3 can be done natively, so
{\tt zip} and {\tt unzip} are not required.

Cores which also implement the Bit-manipulation extension {\em must}
implement the complete {\tt [un]shfli} instruction, with all immediate values
supported.
The full specification of the {\tt shfli} instruction is available in
\cite[Section 2.2.3]{riscv:bitmanip:draft}.

\subsubsection{Carry-less Multiply}

\begin{cryptobitmanipisa}
RV32, RV64:
    clmul  rd, rs1, rs2
    clmulh rd, rs1, rs2
\end{cryptobitmanipisa}

See \cite[Section 2.6]{riscv:bitmanip:draft} for details of
this instruction.
As is mentioned there, obvious cryptographic use-cases for carry-less
multiply are for Galois Counter Mode (GCM) block cipher operations
\footnote{\url{https://en.wikipedia.org/wiki/Galois/Counter_Mode}}.
GCM is recommended by NIST as a block cipher mode of operation
\cite{nist:gcm}, and is the only {\em required} mode for the TLS 1.3
protocol.

See Section \ref{sec:scalar:timing} for additional implementation
requirements for this instruction, related to data independent
execution latency.

\subsubsection{Logic With Negate}

\begin{cryptobitmanipisa}
RV32, RV64:
    andn rd, rs1, rs2
     orn rd, rs1, rs2
    xnor rd, rs1, rs2
\end{cryptobitmanipisa}

See \cite[Section 2.1.3]{riscv:bitmanip:draft} for details of
these instructions.
These instructions are useful inside hash functions, block ciphers and
for implementing software based side-channel countermeasures like masking.
The {\tt andn} instruction is also useful for constant time word-select
in systems without the ternary Bitmanip {\tt cmov} instruction.

Useful for:
SHA3 Chi step,
bit-sliced function implementations
and
software based power/EM side-channel countermeasures based on masking.

\subsubsection{Packing}

\begin{cryptobitmanipisa}
RV32, RV64:                         RV64: 
    pack   rd, rs1, rs2                 packw  rd, rs1, rs2
    packu  rd, rs1, rs2                 packuw rd, rs1, rs2
    packh  rd, rs1, rs2
\end{cryptobitmanipisa}

See \cite[Section 2.1.4]{riscv:bitmanip:draft} for details of
these instructions.
Some lightweight block ciphers
(e.g., SPARX \cite{DPUVGB:16})
use sub-word data types in their primitives.
The Bitmanip pack instructions are useful for performing rotations on
16-bit data elements.
They are also useful for re-arranging halfwords within words, and
generally getting data into the right shape prior to applying transforms.
This is particularly useful for cryptographic algorithms which pass inputs
around as byte strings, but can operate on words made out of those byte
strings.
This occurs for AES when loading blocks and keys (which may not be
word aligned) into registers to perform the round functions.


\subsubsection{Crossbar Permutation Instructions}
\label{sec:xperm}

\begin{cryptobitmanipisa}
RV32, RV64:
    xperm.n rd, rs1, rs2
    xperm.b rd, rs1, rs2
\end{cryptobitmanipisa}

See \cite[Section 2.2.4]{riscv:bitmanip:draft} for a complete
description of this instruction.

The {\tt xperm.n} instruction operates on nibbles.
The \rsone register contains a vector of $\XLEN/4$ $4$-bit elements.
The \rstwo register contains a vector of $\XLEN/4$ $4$-bit indexes.
The result is each element in \rstwo replaced by the indexed element
in \rsone, or zero if the index into \rstwo is out of bounds.

The {\tt xperm.b} instruction operates on bytes.
The \rsone register contains a vector of $\XLEN/8$ $8$-bit elements.
The \rstwo register contains a vector of $\XLEN/8$ $8$-bit indexes.
The result is each element in \rstwo replaced by the indexed element
in \rsone, or zero if the index into \rstwo is out of bounds.

The instruction can be used to implement arbitrary bit
permutations.
For cryptography, they can accelerate bit-sliced implementations,
permutation layers of block ciphers, masking based countermeasures
and SBox operations.

%Figure \ref{fig:example:xperm} shows example implementations of the
%$4$-bit PRINCE SBox using the instructions.
Lightweight block ciphers using $4$-bit SBoxes include
PRESENT\cite{block:present},
Rectangle\cite{block:rectangle},
GIFT\cite{block:gift},
Twine\cite{block:twine},
Skinny, MANTIS\cite{block:skinny},
Midori \cite{block:midori}.

National ciphers using $8$-bit SBoxes include
Camellia\cite{block:camellia} (Japan), 
Aria\cite{block:aria} (Korea),
AES\cite{nist:fips:197} (USA, Belgium),
SM4\cite{block:sm4:1} (China)
and Kuznyechik (Russia).
All of these SBoxes can be implemented efficiently, in constant
time, using the {\tt xperm.b} instruction\footnote{
    \url{http://svn.clairexen.net/handicraft/2020/lut4perm/demo02.cc}
}.
Note that this technique is also suitable for masking based
side-channel countermeasures.

%\begin{figure}[h]
%\begin{lstlisting}[style=ASM]
%prince_sbox_rv64:
%    li  t0, 0x4D5E087619CA23FB  // Load the prince block cipher SBox
%    xperm.n a0, t0, a0          // a0.4[i] = t0.4[a0.4[i]]
%    ret
%
%prince_sbox_rv32:
%    li  t0, 0x4D5E0876  // Load last  8 elements of prince sbox
%    li  t1, 0x19CA23FB  // Load first 8 elements of prince sbox
%    li  t2, 0x88888888  // Bit mask for MS bits of index nibbles.
%    xperm.n a1, t1, a0  // a1.4[i] = t1.4[a0.4[i]] if a0.4[i] < 8 else 0
%    xor     a0, a0, t2  // Toggle MS bit of each nibble in input vector
%    xperm.n a0, t0, a0  // a0.4[i] = t1.4[a0.4[i]] if a0.4[i] < 8 else 0
%    or      a0, a0, a1  // Or results together.
%    ret
%\end{lstlisting}
%\caption{
%    Example implementations of the $4$-bit PRINCE\cite{block:prince}
%    block cipher SBox using the \mnemonic{xperm.n} instruction.
%}
%\label{fig:example:xperm}
%\end{figure}
